A Semiautomatic Approach for Accurate and Low-Effort Spreadsheet Data Extraction

نویسندگان

  • Zhe Chen
  • Michael Cafarella
چکیده

Spreadsheets contain valuable data on many topics, but they are difficult to integrate with other sources. Converting spreadsheet data to the relational model would allow relational integration tools to be used, but using manual methods to do this requires large amounts of work for each integration candidate. Automatic data extraction would be useful but it is very challenging: spreadsheet designs generally requires human knowledge to understand the metadata being described. Even if it is possible to obtain this metadata information automatically, a single mistake can yield an output relation with a huge number of incorrect tuples. We propose a two-phase semiautomatic system that extracts accurate relational metadata while minimizing user effort. Based on conditional random fields (CRFs), our system enables downstream spreadsheet integration applications. First, the automatic extractor uses hints from spreadsheets’ graphical style and recovered metadata to extract the spreadsheet data as accurately as possible. Second, the interactive repair component identifies similar regions in distinct spreadsheets scattered across large spreadsheet corpora, allowing a user’s single manual repair to be amortized over many possible extraction errors. Through our method of integrating the repair workflow into the extraction system, a human can obtain the accurate extraction with just 31% of the manual operations required by a standard classification based technique. We demonstrate and evaluate our system using two corpora: more than 1,000 spreadsheets published by the US government and more than 400,000 spreadsheets downloaded from the Web.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Novel Approach to Background Subtraction Using Visual Saliency Map

Generally human vision system searches for salient regions and movements in video scenes to lessen the search space and effort. Using visual saliency map for modelling gives important information for understanding in many applications. In this paper we present a simple method with low computation load using visual saliency map for background subtraction in video stream. The proposed technique i...

متن کامل

Semi-automatic Extraction of Houses with Multi Right Angles from Aerial Images

The paper presents a general paradigm of semiautomatic house extraction from aerial stereo image pair. In the semiautomatic extraction system, the house model is defined as the one with multi right angles roof, then under the knowledge of the roof type, low-level and mid-level processing including edge detection, straight line segments extraction and obtaining initial house corners are used to ...

متن کامل

PREDICTION OF LOAD DEFLECTION BEHAVIOUR OF TWO WAY RC SLAB USING NEURAL NETWORK APPROACH

Reinforced concrete (RC) slabs exhibit complexities in their structural behavior under load due to the composite nature of the material and the multitude and variety of factors that affect such behavior. Current methods for determining the load-deflection behavior of reinforced concrete slabs are limited in scope and are mostly dependable on the results of experimental tests. In this study, an ...

متن کامل

Semiautomatic Generation of Data-Extraction Ontologies from Relational Databases

Data extraction is the process used to gather and structure information in documents (e.g.Web pages). One approach to data extraction is the so-called ontology based data extraction. In this approach, an ontology is used as a guide to the parser that extracts data from the source documents. In this context, an ontology is a conceptual schema enriched with information needed to identify data ite...

متن کامل

Improvement of effort estimation accuracy in software projects using a feature selection approach

In recent years, utilization of feature selection techniques has become an essential requirement for processing and model construction in different scientific areas. In the field of software project effort estimation, the need to apply dimensionality reduction and feature selection methods has become an inevitable demand. The high volumes of data, costs, and time necessary for gathering data , ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014